Goto

Collaborating Authors

 model prediction


DPA: AOne-stop Metric to Measure Bias Amplification in Classification Datasets

Neural Information Processing Systems

Most ML datasets today contain biases. When we train models on these datasets, they often not only learn these biases but can worsen them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification in classification datasets. They measure bias amplification between a protected attribute (e.g., gender) and a task (e.g., cooking). These metrics also support fine-grained bias analysis by identifying the direction in which a model amplifies biases. However, co-occurrence-based metrics have limitations -- some fail to measure bias amplification in balanced datasets, while others fail to measure negative bias amplification.


PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Neural Information Processing Systems

We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.


Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Neural Information Processing Systems

There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.


2cd5737c59645f7ef23b2842b705edf2-Paper-Conference.pdf

Neural Information Processing Systems

Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community [33, 3, 31, 42, 36], yet innovations continue to contribute gains to performance, with today's largest models achieving 90%+ top-1 accuracy. To help contextualize progress on ImageNet and provide a more meaningful evaluation for today's stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today's best models achieve upwards of 97% top-1 accuracy. Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are significantly underestimating the performance of these models. On the other hand, we also find that today's best models still make a significant number of mistakes (40%) that are obviously wrong to human reviewers. To calibrate future progress on ImageNet, we provide an updated multilabel evaluation set, and we curate ImageNet-Major1: a 68-example "major error" slice of the obvious mistakes made by today's top models--a slice where models should achieve near perfection, but today are far from doing so.


All Points Matter: Entropy-Regularized Distribution Alignment for Weakly-supervised 3D Segmentation Liyao T ang

Neural Information Processing Systems

This approach may, however, hinder the comprehensive exploitation of unlabeled data points. We hypothesize that this selective usage arises from the noise in pseudo-labels generated on unlabeled data. The noise in pseudo-labels may result in significant discrepancies between pseudo-labels and model predictions, thus confusing and affecting the model training greatly.





1 Data Ingestion

Neural Information Processing Systems

For all other remaining architectures, the reported results are from private datasets. Neck Shaft Angle(NSA) cannot be estimated. Additionally, [? ] requires estimation of the diaphysis Figure 4: Repeatability of the femur morphometry extraction method as measured by error distributions for a) the landmarks/anatomical sizes and b) axis alignment identified by the adapted method. Do the main claims made in the abstract and introduction accurately reflect the paper's Did you specify all the training details (e.g., data splits, hyperparameters, how they were Data splits are available in the GitHub repository. Did you report error bars (e.g., with respect to the random seed after running ex-67 Did you include the total amount of compute and the type of resources used (e.g., Did you mention the license of the assets?